authors: Jakub, Luisa, Max and Stefan
date: 20/01/2019
On average the top 100 songs are:
##
## Call:
## lm(formula = loudness ~ energy + speechiness, data = dataset_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4517 -0.7169 0.0492 0.7903 2.4026
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10.5590 0.6163 -17.132 < 2e-16 ***
## energy 8.3505 0.8550 9.766 4.29e-16 ***
## speechiness -5.8741 1.2514 -4.694 8.80e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.159 on 97 degrees of freedom
## Multiple R-squared: 0.5949, Adjusted R-squared: 0.5865
## F-statistic: 71.22 on 2 and 97 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = valence ~ danceability + loudness, data = dataset_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.51835 -0.14305 0.01909 0.12715 0.39991
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.29345 0.11924 2.461 0.0156 *
## danceability 0.69742 0.14420 4.836 4.97e-06 ***
## loudness 0.04642 0.01001 4.638 1.10e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1792 on 97 degrees of freedom
## Multiple R-squared: 0.3282, Adjusted R-squared: 0.3144
## F-statistic: 23.7 on 2 and 97 DF, p-value: 4.168e-09
## PC1 PC2 PC3 PC4 PC5 PC6
## danceability -0.181 0.477 -0.143 0.459 -0.045 -0.229
## energy -0.490 -0.326 -0.040 0.018 0.014 -0.187
## key 0.027 -0.018 -0.499 -0.450 -0.090 -0.306
## loudness -0.574 -0.165 0.077 -0.098 0.043 -0.025
## mode 0.121 -0.027 0.561 0.249 0.251 -0.004
## speechiness 0.319 0.084 -0.423 0.443 0.045 -0.109
## acousticness 0.072 0.435 0.242 -0.354 -0.083 0.292
## instrumentalness 0.015 -0.209 0.265 0.250 -0.642 -0.190
## liveness -0.084 -0.207 -0.153 0.186 0.597 0.274
## valence -0.441 0.299 -0.026 0.218 -0.023 -0.042
## tempo 0.228 -0.510 -0.028 0.147 -0.073 -0.023
## duration_min 0.144 0.079 0.275 -0.166 0.375 -0.780
## 1.524 1.329 1.119 1.070 1.019 0.982
Based on eigenvalues > 1, we select 3 main factors (factor must explain more than individual variables).
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = dataset_1[, c(4:15)])
## Overall MSA = 0.54
## MSA for each item =
## danceability energy key loudness
## 0.48 0.54 0.40 0.55
## mode speechiness acousticness instrumentalness
## 0.53 0.46 0.55 0.24
## liveness valence tempo duration_min
## 0.51 0.66 0.64 0.46
## [1] "PCA Condition 1: the KMO coefficient 0.536 must be above 0.5 in order to justify the appropriateness of PCA"
## [1] "PCA Condition 2: the p-value from barlett test 0 must be below 0.05 to reject the null that the population correlation matrix is an identity matrix and continue with PCA"
While this analysis provides some insights and supports our summary statistics, we will not elaborate on PCA further because of the KMO coefficient (.54) is not very strong, indicating that PCA analysis might not be the best in this case.
We identified a range of external factors that must be taken into account when looking at the number of appearance:
In 2017, leading artists (Ed Sheeran, Louis Fonsi, Chainsmokers) released highly successful albums or singles.
When it comes to Spotify itself, it is a streaming service whose users consume music - they want new songs and it is exactly new songs that make it in the top 100 of the year. Consumers want to consume, producers must produce in order to claim success.
| artist | no_of_appearance | genre |
|---|---|---|
| Ed Sheeran | 4 | pop |
| The Chainsmokers | 4 | pop,electropop |
| Drake | 3 | hip hop,R&B |
| Martin Garrix | 3 | electronic,house,progressive house |
| Bruno Mars | 2 | R&B,funk,pop,soul,reggae,hip hop,rock |
| Calvin Harris | 2 | electronic |
| Clean Bandit | 2 | electronic |
| DJ Khaled | 2 | hip hop |
| Imagine Dragons | 2 | pop rock |
| Kendrick Lamar | 2 | hip hop |
| Khalid | 2 | R&B |
| Luis Fonsi | 2 | latin,pop |
| Maroon 5 | 2 | pop,pop rock |
| Post Malone | 2 | hip hop |
| The Weeknd | 2 | R&B,pop,hip hop |
| ZAYN | 2 | R&B,pop |
## [1] "16 artists share 38 % of TOP 100 songs, making it 2.375 songs per super popular artist"
note: genres were taken from artists’ profiles on Wikipedia as it was impossible to integrate the TOP100 dataset with the Spotify API
The music genres of the most popular artists on Spotify are pop, hip hop and R&B. Could these be factors for writing a popular song?
## # A tibble: 5 x 2
## genre n
## <chr> <int>
## 1 pop 7
## 2 hip hop 6
## 3 r&b 5
## 4 electronic 3
## 5 pop rock 2
#popularity algorithm based on the number of the number of streams and the actual number of days the song could have stayed in the top 200 for
master_stats <- master_dataset_gl %>% group_by(Id = master_dataset_gl$id, Artist = master_dataset_gl$artists, Song = master_dataset_gl$name) %>%
summarise(MinDate = min(Date),
TotalStreams = sum(Streams),
NumberOfDays = sum(count),
StreamsPerDay = TotalStreams/NumberOfDays,
Popularity = (TotalStreams/AllStreams)*(NumberOfDays/as.numeric((as.Date('2018-01-01')-min(Date))))) %>% arrange(TotalStreams, desc(TotalStreams))
| x | |
|---|---|
| danceability | 0.7027656 |
| energy | 0.6637505 |
| key | 5.0563206 |
| loudness | -5.4997869 |
| mode | 0.5532670 |
| speechiness | 0.0950520 |
| acousticness | 0.1808356 |
| instrumentalness | 0.0049967 |
| liveness | 0.1454561 |
| valence | 0.5470566 |
| tempo | 116.6150685 |
| duration_min | 3.6988909 |
rm(list=setdiff(ls(), "path"))
Meta:
Load necessary packages, that haven’t been loaded before and the repsective dataset first
Before jumping into the analysis and answering questions around dataset II, we will try to understand the structure and content of this dataset.
We can see that we are dealing with a pretty long dataset consisting of more than 3.4 million rows and merely 7 columns. For each day of the year in a given region the dataset records the top 200 artists, tracks and the number of respective streams
summary(dataset_2)
## Position Track Name Artist Streams
## Min. : 1.00 Length:3441197 Length:3441197 Min. : 1001
## 1st Qu.: 45.00 Class :character Class :character 1st Qu.: 3322
## Median : 92.00 Mode :character Mode :character Median : 9227
## Mean : 94.64 Mean : 51892
## 3rd Qu.:143.00 3rd Qu.: 29658
## Max. :200.00 Max. :11381520
## URL Date Region
## Length:3441197 Length:3441197 Length:3441197
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
head(dataset_2)
## Position Track Name Artist Streams
## 1: 1 Reggaetón Lento (Bailemos) CNCO 19272
## 2: 2 Chantaje Shakira 19270
## 3: 3 Otra Vez (feat. J Balvin) Zion & Lennox 15761
## 4: 4 Vente Pa' Ca Ricky Martin 14954
## 5: 5 Safari J Balvin 14269
## 6: 6 La Bicicleta Carlos Vives 12843
## URL Date Region
## 1: https://open.spotify.com/track/3AEZUABDXNtecAOSC1qTfo 2017-01-01 ec
## 2: https://open.spotify.com/track/6mICuAdrwEjh6Y6lroV2Kg 2017-01-01 ec
## 3: https://open.spotify.com/track/3QwBODjSEzelZyVjxPOHdq 2017-01-01 ec
## 4: https://open.spotify.com/track/7DM4BPaS7uofFul3ywMe46 2017-01-01 ec
## 5: https://open.spotify.com/track/6rQSrBHf7HlZjtcMZ4S4bO 2017-01-01 ec
## 6: https://open.spotify.com/track/0sXvAOmXgjR2QUqLK1MltU 2017-01-01 ec
Considering the data provided, we are interested in making any guesses on behavior over time or relations between regions and continents. Taking a look at the structure we see that the data set is pretty clean, meaningfully named and doesn’t contain any NAs. To make life a little easier we do a little bit of cleaning:
colnames(dataset_2)<-gsub(" ", "", colnames(dataset_2), fixed = TRUE) #remove blankspace
dataset_2$Date<-as.Date(dataset_2$Date) #parse as date
daily_spotify<-copy(dataset_2) # copy into new data.table, we need it later
Firstly it could be interesting to check, whether it is possible to guess the movement of a track in the top 200 and make predictions on its ranking tomorrow or even after. To address this question from a top level perspective we simply plot the movements of tracks of artists over time. Since the US is the biggest market of Spotify, we will subset our data on this region to reduce dimensionality and complexity.
Now we can plot the movement of tracks for a artist and compare a few of those among the top 200.
plot_RankingPattern("Ed Sheeran")
plot_RankingPattern("Drake")
plot_RankingPattern("Post Malone")
plot_RankingPattern("Calvin Harris")
What we can see in the plots is that there are certain patterns but the movement of a song is quite different from song to song and also seems to be influenced by the artist and other factors. For example back in March 2017 when Ed Sheera dropped his album suddenly many tracks entered the top 200 of which some left after some days only and Shape of you survived till the end of you, suddenly even getting boosted again (maybe due to top 2017 playlists and year in music reviews.). It seems that some artists are able to virally drop a lot of their releases at the top positions (Ed Sheeran, Drake), while others steadily climb up the charts
A trivial approach to predict tomorrows ranking based on the graphs would be to assume that the song will rank on the same place or a little below. It seems to be too complicated to project a songs ranking position of tomorrow, thus we will skip this topic.
We want to calculate for how many days a certain song stays in the top X position of the top 200. Looking at the plots from above we assume that track slowly but continously leaves the top X. We do not count in such way that a song has to continously day by day. For sakes of simplicity we do a simple count.
The function will be run for various Top-rankings (Top 3, Top 5, …) and for all regions in the initial dataset. We enter all values into a new datatable and add the information, to which continent a country belongs by applying the countrycode package. Finally, the figures are plotted for various TopN-Rankings. We can immediately see, that countries in the American continent listen to the same tracks way longer than countries in Europe.
# Function that counts appearance of a track in a selected subset of TopN for a selected country
# Result is a mean over all tracks in selected subset
stay_in_ranking<-function (topN, region){
daily_N_region<-daily_spotify[Position<(topN+1)& daily_spotify$Region %in% region,.(Artist,Streams,TrackName,URL)]
gtN<-daily_N_region %>% count(TrackName,sort=T)
mean(gtN$n)
}
# We now want to apply this function to Top3, Top5,... and do this for every region
#We create a vector of all desired topN rankings and a columnName vector to build a data.table around that
rankingN<-c(3,5,10,15,20,50)
rankingName<-c("Top3","Top5","Top10","Top15","Top20","Top50")
all_regions<-unique(daily_spotify$Region)
region_ranking_persistance<-data.table(country=all_regions)
for (i in 1:length(rankingName)){
region_ranking_persistance[,rankingName[i]:=numeric()]
}
# now we can call the stay_in_Ranking function for each cell in the data table respectively
# attention
for (i in 1:nrow(region_ranking_persistance)){
region<-region_ranking_persistance[i,1]
for (j in 2:ncol(region_ranking_persistance)){
region_ranking_persistance[i,j]<-stay_in_ranking(rankingN[j-1], region=region)
}
}
#we add continent information to the result
region_ranking_persistance$continent<-countrycode(sourcevar = region_ranking_persistance[, country],
origin = "iso2c",
destination = "continent")
# and plot the data for the Top 5, Top10, Top 50
ggplot(region_ranking_persistance,aes(country,Top5,x=reorder(country,-Top5),y=Top5))+geom_bar(stat="identity",aes(fill=continent))+labs(title="Ranking Persistance of Songs over countries",x="Country",y="Number of days a song stays in Top 5")+theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplot(region_ranking_persistance,aes(country,Top10,x=reorder(country,-Top10),y=Top10))+geom_bar(stat="identity",aes(fill=continent))+labs(title="Ranking Persistance of Songs over countries",x="Country",y="Number of days a song stays in Top 10")+theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplot(region_ranking_persistance,aes(country,Top50,x=reorder(country,-Top50),y=Top50))+geom_bar(stat="identity",aes(fill=continent))+labs(title="Ranking Persistance of Songs over countries",x="Country",y="Number of days a song stays in Top 50")+theme(axis.text.x = element_text(angle = 90, hjust = 1))
We can derive that globally, for the majority of a time a succesful song stays very high in the ranking and then suddenly looses momentum. This relates the impressions from the first plots we did here. A succesful song seems to virally enter top positions, stay there for quite some time and then gradually descends.
The RPackage countrycode will help us to retrieve the continent for each region, which are represented by iso2-codes. We add the continent information into a new data.table.
# copy dataset and exclude all "global" regions
dataset_2_continents<-as.data.table(dataset_2[Region!="global"])
# add new column and add continent of respective ISO-code of country
dataset_2_continents$continent<-countrycode(sourcevar = dataset_2_continents[, Region],
origin = "iso2c",
destination = "continent")
With this information we can now calculate all possible combinations of two continents, which aren’t that many since only four continents are included in the dataset. We do this performing the combinations function of the gtools package on all unique continents of the dataset.
Subsequently we calculate both the Similarity between all unique tracks and unique artists of two continents. We do this by dividing the intersection set of two continents by the union set of two continents.
In the final result we can see, that continents actually differ quite a lot. A similarity of around 20% is the maximum.
# gather all possible combinations of continents
continent_comparison<-as.data.table(combinations(length(unique(dataset_2_continents$continent))
,2,unique(dataset_2_continents$continent)))
# measure track similarity
# similarity: defined as interesect set divided by union set.
for (i in 1:nrow(continent_comparison)){
continenta<-continent_comparison[i,V1]
continentb<-continent_comparison[i,V2]
subseta<-dataset_2_continents[continent %in% continenta]
subsetb<-dataset_2_continents[continent %in% continentb]
unique_a<-unique(subseta$TrackName)
unique_b<-unique(subsetb$TrackName)
continent_comparison[i,Track_similarity:=length(intersect(unique_a, unique_b))/
(length(unique_a)+length(unique_b)-length(intersect(unique_a, unique_b)))]
}
# measure Artist similarity
for (i in 1:nrow(continent_comparison)){
continenta<-continent_comparison[i,V1]
continentb<-continent_comparison[i,V2]
subseta<-dataset_2_continents[continent %in% continenta]
subsetb<-dataset_2_continents[continent %in% continentb]
unique_a<-unique(subseta$Artist)
unique_b<-unique(subsetb$Artist)
continent_comparison[i,Artist_similarity:=length(intersect(unique_a, unique_b))/
(length(unique_a)+length(unique_b)-length(intersect(unique_a, unique_b)))]
}
print(continent_comparison)
## V1 V2 Track_similarity Artist_similarity
## 1: Americas Asia 0.12406810 0.1301722
## 2: Americas Europe 0.13295790 0.1384055
## 3: Americas Oceania 0.19256757 0.2169648
## 4: Asia Europe 0.10115277 0.1076869
## 5: Asia Oceania 0.14378217 0.1455954
## 6: Europe Oceania 0.09522632 0.1123489
In order to answer this question, we need to do three substeps. First, we need a list of all possible combinations of two countries in the spotify dataset. Second, we want to calculate the distance between all these possbible combinations. And finally, we want to calculate the similarity and time-lag between the rankings of these two countries.
Our initial hypothesis is, that we live in a globalized and connected world which will also reflect in pretty similar music rankings between countries and a fast spread of new top tracks globally.
While R packages can help for the first two steps, we have to elaborate on metrics to compare two countries. We will calculate similarity in the same way we did above for continents. On top of that we calculate the measure time-lag as the difference of days between the date of Song X appearing in the ranking of country A vs country B. On top of that we will calculate the share of total songs that appear in the top 200 on the same day in both countries (day_zero)
We beginn calculating this for top 10 streaming countries.
First step
Calculate all possible combinations of two countries of the top 10 streaming countries and store result in data.table. We obtain 45 possible combinations for which we then can calculate our metrics described above.
# first subset. Top 10 Streaming countries
## obtain top 10 streaming countries
top10_countries<-daily_spotify[Region!="global",.(Sum=sum(Streams)),by=Region][order(-Sum)][1:10]
top10_countries<-top10_countries$Region
# get all possible combinations of two countries from this set
top10_countries_combinations<-as.data.table(combinations(length(unique(top10_countries)),2,unique(top10_countries)))
# 45 combinations in total!
setnames(top10_countries_combinations,c("V1","V2"),c("C1","C2"))
Second step
We need to calculate the distances of all the combinations above. The RPackages geos, geosphere and worldmap will help us to do so. They contain coordinates of centroids for all countries in the world and allow to calculate distances between two points. We store all distances in the table of country combinations that was calculated before.
# get world map data
wmap <- getMap(resolution="high")
# get centroids
centroids <- gCentroid(wmap, byid=TRUE)
# get a data.table with centroids
centroids_df <- as.data.frame(centroids)
centroids_dt<-as.data.table(centroids_df,keep.rownames=T)
# use iso code instead of country names and put them all in lower case
centroids_dt2<-centroids_dt
centroids_dt2$rn<-countrycode(sourcevar = centroids_dt2$rn,
origin = "country.name",
destination = "iso2c")
centroids_dt2$rn<-tolower(centroids_dt2$rn)
setnames(centroids_dt2,"rn","country")
# Calculation would work as follows
# distm(centroids_dt2[country=="de",.(x,y)], centroids_dt2[country=="us",.(x,y)], fun = distHaversine)[1,1]/1000
# only use rows with relevant countries
centroids_dt2 <- centroids_dt2[centroids_dt2$country %in% daily_spotify$Region, ]
Finally we can write the distance of each pair of countries into our initial table of all combinations. A for-loop is used to get the values in column 1 and 2 and hand them over as arguments to the distm function. Using basic data.table functions we add the respective distance in a new column.
# write distance into respective row of each pair
for (i in 1:nrow(top10_countries_combinations)) {
top10_countries_combinations[i,distance:=(distm(centroids_dt2[country==top10_countries_combinations[i,1],.(x,y)], centroids_dt2[country==top10_countries_combinations[i,2],.(x,y)], fun = distHaversine)[1,1]/1000)]
}
head(top10_countries_combinations,5)
## C1 C2 distance
## 1: au br 15889.69
## 2: au de 14574.49
## 3: au es 15855.09
## 4: au gb 15333.46
## 5: au mx 14320.34
Third step
Now we are able to calculate all metrics and incorporate them into a copied table.
data_dataset2_top10<-top10_countries_combinations
for (i in 1:nrow(data_dataset2_top10)){
countrya<-data_dataset2_top10[i,C1]
countryb<-data_dataset2_top10[i,C2]
subseta<-dataset_2[Region %in% countrya]
subsetb<-dataset_2[Region %in% countryb]
rowIndexa<-as.numeric(match(unique(subseta$TrackName), subseta$TrackName))
rowIndexb<-as.numeric(match(unique(subsetb$TrackName), subsetb$TrackName))
dataset_a_date<-subseta[rowIndexa,.(TrackName,Date)]
dataset_b_date<-subsetb[rowIndexb,.(TrackName,Date)]
tracks_a_b<-merge(dataset_a_date,dataset_b_date,by='TrackName',all=F)
tracks_a_b[,difference:=(Date.y-Date.x)]
data_dataset2_top10[i,similarity:=length(intersect(dataset_a_date$TrackName, dataset_b_date$TrackName))/
(length(dataset_a_date$TrackName)+length(dataset_b_date$TrackName)-length(intersect(dataset_a_date$TrackName, dataset_b_date$TrackName)))]
data_dataset2_top10[i,day_zero:=length(which(tracks_a_b$difference==0))/length(tracks_a_b$difference)]
data_dataset2_top10[i,time_difference:=mean(tracks_a_b$difference)]
}
We plot our three metrics against the distance after cleaning it and perform regressions on it. On the 45 datapoints of the top 10 countries the regression parameter suggests that similarity rises with rising distance, however the parameter isn’t significant at all. The same applies to the time-difference of ranking appearance. We could keep our initial hypohtesis, but we assume that our dataset is a little small since it only contains 10 countries. We repeat the process for all regions in the intital data.
# prepare dataset for further analyses
data_dataset2_top10_clean<-unite(data_dataset2_top10, "country_combination",c(C1,C2), sep = "_", remove = T)
data_dataset2_top10_clean$time_difference<-as.numeric(data_dataset2_top10_clean$time_difference)
# plot scatter and do regressions
ggplot(data_dataset2_top10_clean,aes(distance,similarity))+geom_point()+geom_smooth(method='lm')+theme_bw()
ggplot(data_dataset2_top10_clean,aes(distance,day_zero))+geom_point()+geom_smooth(method='lm')+theme_bw()
ggplot(data=data_dataset2_top10_clean,aes(distance,time_difference))+geom_point()+geom_smooth(method='lm')+theme_bw()
m1=lm(data = data_dataset2_top10_clean,similarity~distance)
summary(m1)
##
## Call:
## lm(formula = similarity ~ distance, data = data_dataset2_top10_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.07935 -0.02867 -0.01185 0.02648 0.14461
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.914e-01 1.650e-02 11.601 7.9e-15 ***
## distance 2.890e-06 1.684e-06 1.716 0.0934 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05571 on 43 degrees of freedom
## Multiple R-squared: 0.06407, Adjusted R-squared: 0.0423
## F-statistic: 2.943 on 1 and 43 DF, p-value: 0.09343
m2=lm(data = data_dataset2_top10_clean,day_zero~distance)
summary(m2)
##
## Call:
## lm(formula = day_zero ~ distance, data = data_dataset2_top10_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.080677 -0.031000 -0.003472 0.032734 0.079710
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.036e-01 1.229e-02 49.11 < 2e-16 ***
## distance -4.730e-06 1.255e-06 -3.77 0.000494 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0415 on 43 degrees of freedom
## Multiple R-squared: 0.2484, Adjusted R-squared: 0.2309
## F-statistic: 14.21 on 1 and 43 DF, p-value: 0.0004937
m3=lm(data = data_dataset2_top10_clean,time_difference~distance)
summary(m3)
##
## Call:
## lm(formula = time_difference ~ distance, data = data_dataset2_top10_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.6405 -4.8177 0.1416 5.0080 19.1421
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.1456617 2.1029030 -1.496 0.142
## distance 0.0002027 0.0002147 0.944 0.350
##
## Residual standard error: 7.101 on 43 degrees of freedom
## Multiple R-squared: 0.02032, Adjusted R-squared: -0.002466
## F-statistic: 0.8918 on 1 and 43 DF, p-value: 0.3503
Firstly, we will calculate all possible combinations and write the distances between those into the resulting table. This time we achieve 1378 possible two-country combinations
#exluding the region global from the data
dataset_2_nonglobal<-dataset_2[Region!="global"]
all_country_combinations<-as.data.table(combinations(length(unique(dataset_2_nonglobal$Region)),2,unique(dataset_2_nonglobal$Region)))
setnames(all_country_combinations,c("V1","V2"),c("C1","C2"))
# write distance into respective row of each pair
for (i in 1:nrow(all_country_combinations)) {
all_country_combinations[i,distance:=(distm(centroids_dt2[country==all_country_combinations[i,1],.(x,y)],
centroids_dt2[country==all_country_combinations[i,2],.(x,y)], fun = distHaversine)[1,1]/1000)]
}
Next, for every combination our metrics are calculated and written into the data.table
data_dataset2<-all_country_combinations
#### function to do the calculations above for all possible combinations
### attention on Run-Time: Took some minutes on a quad-core i7!
for (i in 1:nrow(data_dataset2)){
countrya<-data_dataset2[i,C1] #get country code in column 1 as value
countryb<-data_dataset2[i,C2] #get country code in column 2 as value
subseta<-dataset_2[Region %in% countrya] # subset initial dataset 2 by countrya
subsetb<-dataset_2[Region %in% countryb] # subset initial dataset 2 by countryb
# get row number of first appearance of a track for both datasets
rowIndexa<-as.numeric(match(unique(subseta$TrackName), subseta$TrackName))
rowIndexb<-as.numeric(match(unique(subsetb$TrackName), subsetb$TrackName))
#create a new subset which contains Trackname and date of 1st appearance for all tracks in country
dataset_a_date<-subseta[rowIndexa,.(TrackName,Date)]
dataset_b_date<-subsetb[rowIndexb,.(TrackName,Date)]
# merge tracks which exist in both countries into one table
tracks_a_b<-merge(dataset_a_date,dataset_b_date,by='TrackName',all=F)
#calculate time difference of first appearance of a track between the two countries in table
tracks_a_b[,difference:=(Date.y-Date.x)]
#calculate similarity
data_dataset2[i,similarity:=length(intersect(dataset_a_date$TrackName, dataset_b_date$TrackName))/
(length(dataset_a_date$TrackName)+length(dataset_b_date$TrackName)-length(intersect(dataset_a_date$TrackName, dataset_b_date$TrackName)))]
# calculate day_zero share and time_difference as mean of all time_differences
data_dataset2[i,day_zero:=length(which(tracks_a_b$difference==0))/length(tracks_a_b$difference)]
data_dataset2[i,time_difference:=mean(tracks_a_b$difference)]
}
Again, we can plot the results and perform regressions on the scatter plots. This time we achieve a negative relationship between distance and similarity, which also is very significant! The same applies to our other two metrics.
# prepare dataset for further analyses
data_dataset2_clean<-unite(data_dataset2, "country_combination",c(C1,C2), sep = "_", remove = T)
data_dataset2_clean$time_difference<-as.numeric(data_dataset2$time_difference)
# plot scatter and do regressions
# relationship between distance and similarity
ggplot(data_dataset2_clean,aes(distance,similarity))+geom_point()+geom_smooth(method='lm')+theme_bw()
m=lm(data = data_dataset2_clean,similarity~distance)
summary(m)
##
## Call:
## lm(formula = similarity ~ distance, data = data_dataset2_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.22042 -0.07254 -0.02415 0.05047 0.39322
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.985e-01 4.987e-03 59.86 <2e-16 ***
## distance -8.500e-06 5.264e-07 -16.15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0995 on 1376 degrees of freedom
## Multiple R-squared: 0.1593, Adjusted R-squared: 0.1587
## F-statistic: 260.7 on 1 and 1376 DF, p-value: < 2.2e-16
# relationship between distance and time_lag
ggplot(data_dataset2_clean,aes(distance,time_difference))+geom_point()+geom_smooth(method='lm')+theme_bw()
m2=lm(data = data_dataset2_clean,time_difference~distance)
summary(m2)
##
## Call:
## lm(formula = time_difference ~ distance, data = data_dataset2_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.5542 -6.0322 -0.4456 5.9101 23.4655
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.859e+00 4.226e-01 4.400 1.17e-05 ***
## distance -2.299e-04 4.461e-05 -5.153 2.94e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.432 on 1376 degrees of freedom
## Multiple R-squared: 0.01893, Adjusted R-squared: 0.01822
## F-statistic: 26.55 on 1 and 1376 DF, p-value: 2.935e-07
# relationship between distance and share of same-day top200 entries
ggplot(data_dataset2_clean,aes(distance,day_zero))+geom_point()+geom_smooth(method='lm')+theme_bw()
m3=lm(data = data_dataset2_clean,day_zero~distance)
summary(m3)
##
## Call:
## lm(formula = day_zero ~ distance, data = data_dataset2_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.40810 -0.04904 0.00383 0.06233 0.24190
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.450e-01 4.495e-03 121.249 <2e-16 ***
## distance -4.623e-06 4.745e-07 -9.743 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08969 on 1376 degrees of freedom
## Multiple R-squared: 0.06453, Adjusted R-squared: 0.06385
## F-statistic: 94.92 on 1 and 1376 DF, p-value: < 2.2e-16
The best describing relationship however would not be a linear one. Also it seems that similarity is only significantly higher for countries somehow close to each other. We could partion the data into 2 clusters of near and mid to high-distance country-pairs. It seems that for countries which are close to each other, there are some combinations (potentially neighboring countries) who share more tracks in their top 200 than countries which are further away from each other.
## custering
ggplot(data_dataset2_clean,aes(distance,similarity))+geom_point()+geom_smooth(method='loess')+theme_bw()
k<-2
clust_km<-kmeans(data_dataset2_clean[,c("distance","similarity")],k)
cluster_dataset2_clean <- cbind(data_dataset2_clean, cluster=factor(clust_km$cluster))
ggplot(cluster_dataset2_clean,aes(distance,similarity))+geom_point(aes(color=cluster))+theme(legend.position="none")
After the analysis of the second dataset described above we were able to derive some key findings:
Tracks move over the top 200 very differently. Some do enter top positions of rankings, rock the rankings for many weeks and then slowly descent. Others enter at a low position and gradually move up. Also, there are songs which reenter the Top200 after they have left them a few weeks before. Many unconsidered factors such as release dates of albums, singles and videos, artist promotions, tour dates and curated playlist potentially have an influence on this movement.
The number of days which a song stays in a Top Ranking position heavily depends on the country you are looking at. Countries in the Americas tend to like songs in the Top-Ranking for a longer time than in countries in Europe for example. Maybe European countries are more influenced by American artists than the ohter way round.
The Similarity of rankings between entire continents is pretty low.
For the entire dataset, distance does affect the similarity of the Top 200 between two countries to some extent. However it seems that this tends to be pronounced only to a certain level of distance (~6000km), whereas at any distance above that, the distance does not really affect similarity of the Top200 between two countries. We assume, that countries within one continent somehow share more songs in their ratings, which are represented in the top-left corner of the distance vs similarity chart.